Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli
Abstract: Understanding whether and how language model agents pursue goals is essential for ensuring the safety of AI systems deployed to act autonomously in the world. In this work, we study goal-directedness in a language model agent, GPT-OSS-20B, as it navigates procedurally generated 2D grid environments. We operationalize goal-directedness behaviourally--through the optimality of an agent's actions and through its robustness to environment perturbations--and representationally--by probing the agent's internal activations for evidence of structured spatial knowledge. Our behavioural evaluation reveals that GPT-OSS-20B generally acts as a goal-directed agent, navigating towards the goal across a range of grid sizes with above-chance optimality. Representationally, linear and MLP probes trained on the agent's residual stream activations at intermediate layers uncover internal representations that partially encode the spatial layout of the environment, including the positions of walls, the goal, and the agent itself. Taken together, our results indicate that GPT-OSS-20B can act as a goal-directed agent through reliance on internal representations that partially but non-trivially encode the spatial features of its environment.
Paper: arxiv.org/abs/2602.08964
Data and trained probes: huggingface.co/project-telos
@article{arghal-etal-2026-behavioural,
title={A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents},
author={Raghu Arghal and Fade Chen and Niall Dalton and Evgenii Kortukov and Calum McNamara and Angelos Nalmpantis and Moksh Nirvaan and Gabriele Sarti and Mario Giulianelli},
year={2026},
journal={arXiv preprint arXiv:2602.08964},
url={https://arxiv.org/abs/2602.08964}
}# Clone the repository
git clone https://github.com/SPAR-Telos/interp
cd interp
# Install with uv (recommended)
uv sync
# Or install with pip
pip install -e .
# For vLLM-based activation extraction (requires GPU)
pip install -e ".[vllm]"Pre-computed trajectories, activations, trained probes, and evaluation results are available on the project-telos HuggingFace organization.
Trained probes:
| Repository | Description |
|---|---|
| cognitive_map_probes | Trained cell identity classification probes |
| distance_probes | Trained A* distance regression probes |
| decoder_probes | Trained decoder probes |
Datasets:
| Repository | Description |
|---|---|
| trajectories_train_single_step | Training trajectories (single step) |
| trajectories_test_full | Test trajectories (full episodes) |
| trajectories_test_full_with_cognitive_map_probes | Test trajectories with probe predictions |
| trajectories_key_door_100 | Key-door environment trajectories |
| trajectories_key_no_door_100 | Key-no-door environment trajectories |
| probes_train_single_step | Prepared probe training data |
| activations_test_full | Extracted test activations |
| activations_key_door_env_100 | Key-door environment activations |
| cognitive_map_probes_results | Cell identity probe evaluation results |
| distance_probes_results | Distance probe evaluation results |
The analysis pipeline has four stages. Each stage uses a CLI command provided by the interp-cli tool. See telos_interp/commands/README.md for full documentation of all commands and options.
To use pre-computed data, download the relevant datasets from the HuggingFace organization above and point the CLI commands to the downloaded directories.
Extract model activations from trajectory JSON files:
interp-cli gather_activations \
--trajectory-paths "data/trajectories/size5/*.json" \
--output-dir data/activations/size5 \
--layers all \
--steps 0 \
--output-indices -1Format extracted activations into datasets suitable for probe training:
interp-cli prepare_activations_for_probing \
--activations-dir data/activations/size5 \
--trajectories-dir data/trajectories/size5 \
--probe-type grid_tile \
--output-indices -1 \
--balance-classes-per-trajectoryTrain cell identity classifiers or distance regression probes:
# Cell identity probe
interp-cli train_cognitive_map_probe \
--train-data-path data/activations/size5/cognitive_map_activations_*.pt \
--model-type mlp \
--hidden-dims "512,256" \
--num-epochs 100
# Distance regression probe
interp-cli train_distance_probe \
--train-data-path data/activations/size7/distance_activations_*.pt \
--model-type mlp \
--hidden-dims "512,256" \
--num-epochs 100Example configuration files are provided in configs/.
Evaluate probes on held-out data and apply them to generate trajectory-level predictions:
# Evaluate cell identity probe
interp-cli eval_cognitive_map_probe \
--trajectories-dir data/trajectories/size5_test \
--activations-dir data/activations/size5_test \
--probe-path path/to/cognitive_map_probe.pt \
--output-indices -1
# Apply probe to trajectories
interp-cli apply_cognitive_map_probe \
--activations-dir data/activations/size5 \
--trajectories-dir data/trajectories/size5 \
--probe-path path/to/cognitive_map_probe.pt \
--output-dir data/trajectories_with_probes/size5 \
--layers 20 \
--steps all \
--output-indices -1See CONTRIBUTING.md for development setup, code style, and testing instructions.
This project is licensed under the Apache License 2.0. See LICENSE for details.